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Abstract 

Background: Relative expression algorithms such as the top-scoring pair (JSP) and the top-scoring triplet (TST) 
have several strengths that distinguish them from other classification methods, including resistance to 
overfitting, invariance to most data normalization methods, and biological interpretability. The top-scoring 
'N' (TSN) algorithm is a generalized form of other relative expression algorithms which uses generic permutations 
and a dynamic classifier size to control both the permutation and combination space available for 
classification. 

Results: TSN was tested on nine cancer datasets, showing statistically significant differences in classification 
accuracy between different classifier sizes (choices of N). TSN also performed competitively against a wide variety of 
different classification methods, including artificial neural networks, classification trees, discriminant analysis, 
k-Nearest neighbor, naive Bayes, and support vector machines, when tested on the Microarray Quality Control II 
datasets. Furthermore, TSN exhibits low levels of overfitting on training data compared to other methods, 
giving confidence that results obtained during cross validation will be more generally applicable to external 
validation sets. 

Conclusions: TSN preserves the strengths of other relative expression algorithms while allowing a much larger 
permutation and combination space to be explored, potentially improving classification accuracies when fewer 
numbers of measured features are available. 

Keywords: Classification, Top-scoring pair. Relative expression. Cross validation. Support vector machine, Graphics 
processing unit, Microarray 



Background 

Relative expression algorithms such as the top-scoring 
pair (TSP) [1] and the top-scoring triplet (TST) [2] repre- 
sent powerful methods for disease classification, primar- 
ily focused on the creation of simple, yet effective 
classifiers. These algorithms have several strengths that 
distinguish them from other classification methods. First, 
only the ranks of the expression data are used, rather 
than the expression values directly, therefore these algo- 
rithms are invariant to data normalization methods that 
preserve rank-order. For example, quantile normalization 
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is a rank-preserving common practice in microarray ana- 
lysis to remove technical sources of variance between 
arrays [3]. It is therefore preferable that the classification 
algorithm be insensitive to such normalization proce- 
dures, particularly in meta-analyses combining data from 
multiple studies or in a clinical setting where additional 
measurements beyond the features used to build the 
classifier would be needed to apply the normalization 
step. Second, relative expression classifiers make use of 
only a few features to build each classifier, and require 
relatively little to no parameter tuning. As a result, the 
algorithms are generally resistant to overfitting, in which 
an algorithm learns to classify the noise of the training 
set rather than the true phenotypic signal of interest. 
Moreover, the small number of features in relative 
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expression algorithms lends itself well to the develop- 
ment of inexpensive clinical tests [4]. Third, an underap- 
preciated aspect of relative expression algorithms involves 
their potential for biological interpretation. The simplicity 
of these algorithms, in which the ranks of a few features 
shift positions in a predictable way between two pheno- 
typic classes, suggests that the features participating in a 
highly accurate classifier may represent or reflect an 
underlying biological role for those features in the pheno- 
types being classified. Relative expression algorithms may 
therefore serve as hypothesis generators for additional 
study. This characteristic may become particularly rele- 
vant as classification methods move increasingly more 
into technologies such as secretomics and miRNA expres- 
sion measurements that, at present, result in fewer mea- 
surements per sample than do transcriptomes. 

In this paper we present a new formulation of the rela- 
tive expression classification algorithm that generalizes 
the idea of pairwise rank comparisons (TSP) and triplet 
rank comparisons (TST) into generic permutation rank 
comparisons, where the size of the classifier is not 
defined a priori. This algorithm is called the top-scoring 
'AT (TSN), where A/ is a variable indicating the size of 
the classifier. As such, TSP and TST can be thought of 
as special cases of the general TSN algorithm (just with 
a fixed N= 2 or N= 3, respectively). Because the classifier 
size is unconstrained, TSN can explore a much larger 
permutation and combination space than that available 
to either TSP or TST. All of the results presented in this 
paper used no more than sixteen features from any of 
the training sets. 

The classification accuracy of the existing relative ex- 
pression algorithms has been demonstrated in several 
studies. Classifiers identified using relative expression 
algorithms have been used to distinguish multiple cancer 
types from normal tissue based on expression data 
[1,2,4,5] as well as to predict cancer outcomes and model 
disease progression [6]. Furthermore, relative expression 
algorithms perform competitively when compared to 
other, often more complex, classification methods, in- 
cluding support vector machines [7], decision trees [8] 
and neural networks [9]. Relative expression algorithms 
have also been applied in a network context, illustrating 
the dysregulation of cellular pathways in disease pheno- 
types [10]. 

We first demonstrate that both TSP and TST are spe- 
cial cases of the TSN algorithm. We illustrate the per- 
formance of a range of TSN classifier sizes on a set of 
nine cancer datasets. Finally, we demonstrate that TSN 
performs competitively when compared to a broad range 
of classification models, including artificial neural net- 
works, classification trees, and support vector machines, 
using data and results from the FDA-sponsored Micro- 
array Quality Control II project (MAQC-II) [11]. 



Methods 

Overview of relative expression algorithms TSP and TST 

Given two classes of samples C={Cj, C2}: for which 
ranked expression data are available on M features 
X={xi,. . .^Mit the TSP algorithm [1] searches for the 
feature pair {x,, Xj] that maximizes the TSP score A^, 
defined as: 

A,;y = \Pr{xi < Xj\C = Ci) — Pr{xi < Xj\C = C2) j 

The TSP algorithm identifies the best pair of features for 
which the rank of Xi falls lower than the rank of Xj in 
most or all samples in class Ci, and the rank of Xt falls 
lower than the rank of Xj in few or no samples of class 
C2. The max (A^ = 1) indicates a perfect classifier on the 
training set in which no samples deviate from this pat- 
tern. Classification is performed by comparing the order- 
ing of features {Xj, xj\ in each sample of the test set to 
the orderings associated with the two classes. A variant 
on this algorithm known as k-TSP makes use of multiple 
disjoint pairs to improve classification accuracy [5]. 

The top-scoring triplet (TST) algorithm [2] extends 
the TSP algorithm to triplets of features. The six unique 
permutations nj,. . .,tt6 of each feature triplet {Xp Xj, X/^} 
are now considered explicitly, where: 

ni = {xi < Xj < Xk} , n2 = {xi < Xk < Xj} , = {xj < x, <Xk} 
ni={xj<Xk <Xi},n5=[xk <Xi<Xj},n:6={xk <Xj<Xi} 

These permutation counts are accumulated for each 
sample of the training set, and the TST score Ay ,t to be 
maximized is then calculated as follows: 

1^11 

^i,i,k = ^Pr(n„^C=Ci^-Pr(^jim\C=C2j^,i ^ i^k 

m— 1 

The top-scoring 'N' algorithm 

The top-scoring 'N' algorithm, as the name implies, 
extends these relative expression algorithms to a generic 
permutation size. Within the context of feature permu- 
tations, TSP and TST can be thought of as special cases 
of the TSN algorithm, where a fixed N=2 and N=3 are 
used, respectively. The TSN algorithm uses a nonstan- 
dard counting system known as factoradics, or factorial- 
radix numbers. Briefly, factoradics can be described as a 
mixed-radix counting system in which the multiplicative 
factor for each digit placeholder is derived from the set 
of factorial numbers. An example of factoradics com- 
pared to two other common fixed-radix counting sys- 
tems is shown in Additional file 1: Figure SI. Given that 
the factoradic counting system is intimately related to 
the factorial numbers, it is perhaps not surprising that 
there is a relationship between factoradics and permuta- 
tions. There exist M permutations of a set of N objects. 
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and therefore each permutation of N objects may be 
represented by an integer from 0 to M-1. Factoradics pro- 
vide a mechanism by which permutations may be 
uniquely represented, and the translation between a per- 
mutation and its corresponding factoradic is known as 
the Lehmer code (Figure 1). Using factoradics, every per- 
mutation has a one-to-one correspondence with a deci- 
mal number. Several examples of permutation-to-decimal 
translations via factoradics are shown in Additional file 1: 
Figure S2. 

The TSN algorithm works as follows: given two classes 
of samples C={Cy, C2] with rank values for M features 
{xj,. . -iXm}) and a classifier size N, the TSN algorithm 
identifies the feature set X={ that maximizes 

the sum of the difference of the permutation probability 
distribution between the two classes: 

TV! 
m— 1 

where tr,„ is the mth permutation of the classifier X. Re- 
call that there are M possible permutations of X. The 
permutation probability distribution for each class is 
determined by mapping the permutation of X for each 
training set sample to its corresponding factoradic, con- 
verting the factoradic to decimal representation, and 



Figure 1 The Lehmer code. A complete translation from 
permutation to decimal, by way of the factoradic, for a permutation 
of size 4. Each permutation is mapped to a single unique decimal 
representation. Two additional translations from permutation to 
factoradic are shown in Additional file 1: Figure S2. 



using this as an index into a histogram of size M. Once 
normalized by the number of samples in each class, the 
histogram represents the permutation probability distri- 
bution for that feature set on that training set class. 
When the two histograms are completely disjoint (i.e., 
there are no overlapping permutations between the two 
classes), the TSN score Ax= 1. 

In addition to the primary TSN score, a secondary 
score y is calculated in the event of ties between two 
classifiers. This is simply the distance in rank between 
the first and last element of the classifier X for each 
sample, summed over all the samples of the training set: 

s 

yx = ^ \Rx{N),i - Rx{l),i\ 
i=l 

where S is equal to the number of samples in the train- 
ing set and N the size of the classifier X. R refers to the 
rank, and X(l) and X{N) are the first and last elements 
of the classifier, respectively. In the case of ties in the 
primary TSN score, the classifier chosen will have the 
largest distance in rank between the upper and lower 
elements of the classifier. 

In the case where N=2, the TSN algorithm simply 
reduces to the TSP algorithm, since X2 = { }, and Pr 
{ai)=Pr{Xi<Xj). In the case where N=3, the TSN algo- 
rithm reduces to the TST algorithm, since X3 = {Xi^x^x^} 
and Pr{am) = Pr (71 m)- Because the TSN algorithm uses 
factoradics to uniquely represent any permutation of any 
size classifier, it allows TSP and TSP classifiers to be 
used in concert as well as allowing for even larger classi- 
fiers to be explored. 

The choice of N is clearly important in the determin- 
ation of a new classifier for a training set. The simplest 
method is to choose the value of N with the greatest 
classification accuracy after iteration over a range of N. 
This method would reveal the apparently most effective 
classifier size. In this case the experimenter is artificially 
choosing the 'best' value of N for a given dataset. How- 
ever, in fair comparisons with other classification meth- 
ods it is important that the choice of N not be made a 
posteriori (once the best classifier and value of N have 
been determined) to avoid overly optimistic error esti- 
mates. We do not choose the value of N outside the 
cross validation loop, but rather dynamically select the 
value of N at each iteration of the cross validation loop; 
the choice is made based on the apparent accuracy of 
that value of N on the training set. We call this version 
of the algorithm dynamic N. Apparent accuracy is calcu- 
lated by first finding the highest scoring classifier on the 
training set for each value of TV in a range specified by 
the user. The value of N with the highest apparent ac- 
curacy on the training set is then applied to the test set. 
In the case of ties in apparent accuracy for multiple 



Convert permutation a = 3, 4, 1, 2 to decimal representation 

Step 1 1 Let / = arg min(c) 3, 4, [%] 2 

Step 2: Count the number of digits to the left of digit / that 
are greater than This is the first digit of the factoradic. 

lP'J^lj^' ^ digits: 2 

Step 3: Remove af/j from the permutation. 3, 4, 2 

Step 4: Perform steps 1-3 until no digits remain. 

3, 4,|l7|2 ► [3,~4]]l,2 Num digits: 2 

3,4,(2] ► [3,[4]]2 Num digits: 2 

[3]4 ► 3, 4 Num digits: 0 

[4~| > 4 Num digits: 0 

Step 5: Convert factoradic 2, 2, 0, 0 to decimal, using the 
factorial numbers as the place for each digit. 

Place: 3! 2! 1! 0! 
Factoradic: 2 2 0 0 

2x31+2x21-1-0x1!+ 0x0! = 16 

The unique decimal representation of permutation 
3, 4, 1, 2 is 16. 
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values of N, the algorithm chooses the smallest tied 
value of N for the classifier at that iteration of the cross 
validation loop. This process is repeated at each iteration 
of the cross validation loop. Note that this method does 
not preclude the user from artificially choosing the best 
value of N (outside of cross validation) for other pur- 
poses, but is rather a mechanism to avoid bias during 
cross validation. This allows us to make fair comparisons 
of the TSN algorithm with other classification methods 
without potentially biasing the results in our favor. 

Classification with TSN 

Once the highest scoring classifier X is identified using a 
training set, prediction on a test set is performed by 
comparing the classifier permutation for each sample of 
the test set to the permutation probability distribution of 
the classifier for each class. A class is predicted for each 
sample based on which permutation probability is higher 
for the permutation of that sample. For example, given a 
classifier size of A/ = 4, if a particular sample in the test 
set contains permutation 16, that sample is classified as 
class 1 or class 2 based on which class has higher per- 
mutation probability for permutation 16 in the training 
set. A special case may occur during classification, where 
the probability for a test set permutation is equal (or zero) 
for both classes. In this case, the algorithm adopts a 
maximum likelihood approach to classify the sample. 
First, all permutations are identified with an inversion 
distance of 1 from the original permutation. The inver- 
sion distance is defined as the number of adjacent 
swaps required to convert one permutation into another 
(Figure 2, top panel). For example, given a classifier of 
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Figure 2 inversions. (Top) There are four inversions required to 
translate the sorted list [12 3 4] into the permutation [3 4 1 2]. The 
sum of the digits of the factoradic give the number of inversions 
required to translate one permutation into another. (Bottom) The 
grey squares indicate the set of permutations that have a single 
inversion distance from the original (black) permutations. 



size 4, each permutation has a set of three permutations 
with an inversion distance of 1. For permutation 16, this 
set includes permutations 13, 17, and 22 (Figure 2, bot- 
tom panel). Once the single-inversion permutation set is 
identified, the permutation probability for this set is 
summed for each class. The class with the higher prob- 
ability is chosen. If the single-inversion distance sums are 
the same between the two classes, the algorithm repeats 
the calculation for the permutations with inversion dis- 
tance 2, and so on. If a choice cannot be made, which 
only occurs if both classes have identical permutation 
probability distributions, that sample is considered an 
incorrect prediction for that iteration of the cross valid- 
ation loop. 



Implementation of TSN 

While the TSN algorithm can theoretically explore a very 
large permutation space, the computational require- 
ments of the algorithm rise very quickly and to avoid 
overfitting the number of permutations explored must 
be scaled to what is reasonable given available sample 
numbers. The complexity of TSN is 0((j^)7V!) , where M 
is the number of features and N is the size of the classi- 
fier. We have previously shown [12] that the graphics 
processing unit (GPU) is highly efficient when applied to 
easily parallelizable algorithms such as TSP and TST. 
Given that TSN preserves the parallel nature of the other 
relative expression algorithms, it is also easily applied to 
the GPU. However, given that GPU hardware is not yet 
widely available to many researchers, we are releasing 
the source code for both GPU and CPU implementations 
of the TSN algorithm. TSN has been implemented for 
both the GPU and the CPU in the MATLAB computing 
environment. 

The GPU is a specialized hardware device normally 
used in graphics rendering. The nature of graphics ren- 
dering involves large numbers of vector and matrix 
operations performed in real-time, thus the GPU archi- 
tecture emphasizes massive parallelism. Driven by the 
billion-dollar gaming industry, the GPU has developed 
into a powerful tool currently able to reach over 1 TFLOP 
(trillion floating point operations per second) on a single 
chip in single precision operations. With NVIDIA's 
release of the Compute Unified Device Architecture 
(CUDA) in 2007, general-purpose computation on the 
GPU became accessible. GPUs are increasingly being ap- 
plied to computationally intensive scientific problems, in- 
cluding molecular dynamics simulations [13], weather 
prediction [14], quantum chemistry [15], bioinformatics 
[16], and medical imaging [17]. Plots of running times 
for N=2, N=3, and N=4: over a wide range of input fea- 
ture sizes are shown in Figure 3. The speedup of the 
GPU over the CPU implementations of TSN improves as 
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Number of input features 

Figure 3 GPU vs. CPU running times. Running times for N = 2, N=3, and A/ = 4 over a range of input feature sizes. Each point is the mean of 
three independent runs of the software. The CPU running time for /V = 2 over 20,000 features is similar to the running times for W = 3 over 
1000 features and N = 4 over 200 features. The CPU version of TSN was run on a single core of a 2.4 GHz Intel Core 2 processor. The GPU version 
of TSN was run on an NVIDIA Tesia C2050. The speedup due to the GPU improves as the value of N gets higher: for N = 2, the speedup is 2.3X, 
for N=i the speedup is 2.8X, and for N = 4 the speedup is 4.4X. Running times reflect a single iteration of the algorithm and do not include 
multiple iterations such as cross validation. Note that running times are also a function of the number of samples in the dataset; there were 
70 samples in this dataset 

^ J 



the value of N gets higher, ranging from 2.3X for N=2 to 
4.4X for TV = 4. Pseudocode for the operation of the core 
TSN algorithm is shown in Additional file 1: Figure S3. 
All source code for both the CPU and GPU implementa- 
tions is freely available on http://price.systemsbiology. 
net/ downloads.php. 

Results and discussion 

Multiple values of N 

TSN has been tested on nine cancer datasets that were 
used in the previous k-TSP and TST papers [2,5] for 
comparison between different values of A'^. These data- 
sets represent a wide range of cancers, including colon 
[18], leukemia [19], central nervous system lymphoma 
(CNS) [20], diffuse large B-cell lymphoma (DLBCL) [21], 
prostate [22-24], and a global cancer map (GCM) data- 
set [25]. As discussed in the methods section, the TSN 
algorithm can be used in two different ways: the choice 
of N can be made a posteriori after all fixed values have 
been tested, or the choice of N can be made at each iter- 
ation of the cross validation loop {dynamic N) using ap- 
parent accuracy. Apparent accuracy is calculated by first 
finding the highest scoring classifier on the training set 
for each value of A/in a range specified by the user. The 
value of N with the highest apparent accuracy on the 
training set is then applied to the test set. In order to 



directly compare the accuracies based on the number of 
permutations of features, we chose 16 features for AT =2, 
10 features for N=3, and 9 features for AT =4. This 
results in approximately 120 combinations for each 
value of N. The reason for choosing different numbers of 
features for each value of N is to equalize the combin- 
ation space for each classifier size. For example, a classi- 
fier of size N=2 given 16 features can explore 2! = 2 

permutations over ^2^^ = 120 combinations. A classifier 
of size N=3 given 10 features can explore 3! = 6 permu- 
tations over ^3"^ = 120 combinations. A classifier of size 
AT =4 given 9 features can explore 4! = 24 permutations 
over = 126 combinations. As a result, any difference 

in accuracy between these two classifiers depends pri- 
marily on the permutation space being explored and not 
the combination space (which is held relatively constant). 
The features were chosen to be the most differentially 
expressed genes based on the Wilcoxon rank sum test, 
again selected within each iteration of the cross valid- 
ation loop to avoid overly optimistic estimates. 

Shown in Figure 4 are the results of TSN being applied 
to three of the cancer datasets with fixed values of N as 
well as dynamic N using 5-fold cross validation. To deter- 
mine statistically significant differences between values 
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Leukemia 



DLBCL 



Lung 




=3 N=4 Dyn. N N=2 N=3 N=4 Dyn. N N=2 N=3 N=4 Dyn. N 

Figure 4 Results of TSN classification on cancer datasets. Results of 100 rounds of 5-fold cross validation over a range of /V = {2,3,4} where the 
number of differentially expressed probes is different for each value of W {16,10,9). This yields approximately the same number of possible 
combinations for each value of N (-120), illustrating how classification accuracy can be determined by the permutation itself not just the number 
of combinations available. Results shown include accuracies of fixed values of N as well as the dynamic N algorithm described in the methods 
section. Statistical differences were calculated using the nonparametric Krusl<al-Wallis one-way analysis of variance by ranks, and a p-value<0.05 
was considered significant. If bars share the same letter they are not statistically different. The datasets are derived from [2] and represent a wide 
range of cancers. Significance plots for all nine cancer datasets are in Additional file 1: Figure S4. 



of N, we ran 100 iterations of 5-fold cross validation on 
each of the nine cancer datasets. Each iteration of cross 
validation randomly selected different training and test 
sets, allowing us to measure the distribution of accur- 
acies for each value of N. This was done for fixed N=2, 
fixed A/'=3, fixed Af=4, and dynamic 7V= {2,3,4} as 
described above. Because the resulting distributions of 
accuracies failed a Kolmogorov-Smirnov normality test, 
we used the non-parametric Kruskal-Wallis one-way 
analysis of variance by ranks to measure differences be- 
tween the groups. A p-value < 0.05 was considered sig- 
nificant. Significant differences are indicated by letters 
above each bar; if two bars share the same letter they are 
not statistically different. Significance plots for all nine 
cancer datasets are shown in Additional file 1: Figure S4. 
All raw data is included in Additional file 2. 

It is clear from Figure 4 that the value of N can have a 
significant effect on the resulting accuracy of the classi- 
fier, which indicates that the larger permutation space 
afforded by larger values of N can be useful in identify- 
ing an effective classifier. In the Leukemia dataset, for 
example, N=2 and N='i produced the apparently most 
effective classifiers; in the Lung dataset, N=3> and Af=4 
were the apparent best. In four of the nine datasets 
(DLBCL, Prostate2, ProstateS, and GCM), dynamic N 
yielded no significant difference in accuracy with the 
highest-scoring fixed value of N. In two additional data- 
sets (Leukemia and Lung), the dynamic N accuracy is 
statistically in between the highest- and lowest-scoring 
values of N. In the remaining three datasets (Colon, 
CNS, and Prostatel), the dynamic N accuracy is not sig- 
nificantly different from the lowest-scoring fixed value of 
N. The dynamic N TSN result is the fair estimate of 
how well the algorithm would be expected to perform 



with optimization for TV, without the bias that is intro- 
duced by choosing the apparently best N after the error 
estimate has been made. 

Microarray quality control II datasets 

Published in 2010, the Microarray Quality Control II 
dataset (MAQC-II) [11] was produced by the National 
Center for Toxicological Research at the United States 
Food and Drug Administration in collaboration with 96 
universities and companies from around the world. One 
goal of the project was to build a set of microarray data 
that could be used to validate classification methods in a 
rigorous and systematic manner. To this end, six differ- 
ent microarray datasets representing a range of pheno- 
types, microarray platforms, and sample sizes were 
selected by the consortium. Each dataset was partitioned 
into one or more endpoints, where an endpoint repre- 
sents a class partition to be predicted. A total of thirteen 
endpoints were represented by the six datasets. Each 
endpoint consisted of a training set as well as an inde- 
pendently collected validation set. A listing of the 
MAQC-II datasets and endpoints used in this study is 
provided in Table 1. Note that only five of the datasets 
representing nine endpoints are currently available for 
public download from the Gene Expression Omnibus 
(GSE16716). We tested TSN on all endpoints for which 
data was available. Thirty-six independent groups using 
a variety of classification methods, including artificial 
neural networks, classification trees, discriminant ana- 
lysis, k-Nearest neighbor, naive Bayes, and support vec- 
tor machines, analyzed the MAQC-II training sets and 
provided nearly 20,000 models to the MAQC-II consor- 
tium. It should be noted the groups were not restricted 
to a single classification method, and many chose to use 
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Table 1 The five MAQC-II datasets, representing endpoints A through I that are available from the Gene Expression 
Omnibus 



Dataset 


Endpoint 


Description 


Platform 


Hamner 


A 


Lung tumorigen vs. non-tumorigen 


Affymetrix Mouse 430 2.0 


Iconix 


B 


Non-genotoxic liver carcinogens vs. non-carcinogens 


Amersham Uniset Rat 1 Bioarray 


NIEHS 


C 


Liver toxicants vs. non-toxicants 


Affymetrix Rat 230 2.0 


Breast Cancer 


D 


Pre-operative treatment response 


Affymetrix Human U133A 




E 


Estrogen receptor status 




Multiple Myeloma 


F 


Overall survival milestone outcome 


Affymetrix Human U133 Plus 2.0 




G 


Event-free survival milestone outcome 






H 


Gender of patient (positive control) 








Random class labels (negative control) 





different methods for the different endpoints based on 
what they determined would be the most successful. As 
a result, our single classification method, TSN, is being 
compared against ensembles of methods by most 
MAQC-II participants. Each group ultimately nominated 
a single model from each endpoint training set to be 
tested on the corresponding validation set, and these 
models were compiled into a list of final predictions. To 
further test the classification algorithms, the MAQC-II 
consortium swapped the training and validation sets for 
each endpoint, and each group submitted predictions for 
the swapped datasets. TSN was tested against the groups 
that submitted validation set predictions for every 



available endpoint on both original and swapped data. A 
listing of the participants and their respective classifica- 
tion methods used in this paper is provided in Table 2. 

The metric chosen by the MAQC-II consortium to 
rate the classification models was the Matthew's Correl- 
ation Coefficient (MCC). The MCC has several advan- 
tages over the accuracy/sensitivity/ specificity standard, 
as it is able to detect inverse correlations as well as being 
sensitive to the overall size of the training sets. MCC 
values range from -i-l (perfect prediction) to -1 (perfect 
inverse prediction), with 0 indicating random prediction. 
Note that unbeknownst to the original study partici- 
pants, endpoints H and I were replaced by a positive 



Table 2 The participants that submitted models for every endpoint (original and swap) in the MAQC-II study, and the 
classification methods used 



Code Name Classification algorithm(s) used 



CAS 


Chinese Academy of Sciences 


NaTve Bayes, Support Vector Machine 


CBC 


CapitalBio Gorporation, Ghina 


k-Nearest Neighbor, Support Vector Machine 


Cornell 


Weill Medical Gollege of Gornell University 


Support Vector Machine 


FBK 


Fondazione Bruno Kessler, Italy 


Discriminant Analysis, Support Vector Machine 


GeneGo 


GeneGo, Inc. 


Discriminant Analysis, Random Forest 


GHI 


Golden Helix, Inc. 


Classification Tree 


GSK 


GlaxoSmithKline 


Naive Bayes 


NCTR 


National Center for Toxicological Research, FDA 


k-Nearest Neighbor, NaTve Bayes, Support Vector Machine 


NWU 


Northwestern University 


k-Nearest Neighbor, Classification Tree, Support Vector Machine 


SAI 


Systems Analytics, Inc. 


Discriminant Analysis, k-Nearest Neighbor, Machine Learning, 
Support Vector Machine, Logistic Regression 


SAS 


SAS Institute, Inc. 


Classification Tree, Discriminant Analysis, Logistic Regression, 
Partial Least Squares, Support Vector Machine 


Tsinghua 


Tsinghua University, China 


Classification Tree, k-Nearest Neighbor, Recursive Feature Elimination, 
Support Vector Machine 


UlUC 


University of Illinois, Urbana-Champaign 


Classification Tree, k-Nearest Neighbor, NaTve Bayes, 
Support Vector Machine 


USM 


University of Southern Mississippi 


Artificial Neural Network NaTve Bayes, Sequential Minimal Optimization, 
Support Vector Machine 


ZJU 


Zejiang University, China 


k-Nearest Neighbor, Nearest Centroid 
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control (gender of the study participants) and a negative 
control (random class assignments), respectively. There- 
fore, it was expected that endpoint H would result in 
very high prediction MCC and endpoint I would result 
in MCC close to zero. The MCC is calculated as follows: 



MCC: 



TPx TN-FPx FN 



^{TP + FP){TP + FN) ( TN + FP) ( TN + FN) 



If any of the sums in the denominator of the MCC are 
zero, the denominator is set to be one, resulting in an 
MCC equal to zero. 

As stated above, only five of the six MAQC-II datasets 
are currently available from GEO, therefore we were 
only able to compare TSN to these datasets. All filtering 
and classification was performed using only the training 
data for each dataset - the validation set was left com- 
pletely out of these calculations. Where possible (Affy- 
metrix platforms), the features of each training set were 
first filtered for a high percentage (66%) of present or 
marginal calls using a MATLAB implementation of the 
Affymetrix MAS5 call algorithm [26]. The most differen- 
tially expressed probes for each training set were identi- 
fied using the TSN implementation of the Wilcoxon 
rank sum test. Finally, the dynamic N TSN algorithm 
was used to identify the highest-scoring classifier on the 
training set over a range of N = {2,3,4} and DEC = 



{16,10,9}. As described in the methods section, the algo- 
rithm was allowed to select the best value of N using ap- 
parent accuracy of the training set. The highest scoring 
classifier was then applied to the validation set for each 
endpoint. The results of the TSN algorithm models ap- 
plied to each endpoint validation set in the context of all 
analyzed participants are shown in Figure 5. All raw data 
is included in Additional file 3. The mean MCC value 
across all endpoints (excluding endpoint I, the negative 
control) was also calculated for each participant, and is 
shown in Figure 5. TSN performs competitively on these 
datasets, yielding a mean MCC value across all endpoints 
of 0.444. The maximum mean MCC value achieved by 
any of the groups was SAI, with 0.489. 

In addition to standard cross validation and validation 
set MCC, we also measured the statistical significance of 
different classifier sizes. As described with the cancer 
datasets above, we ran 100 iterations of TSN using fixed 
values of N=2, N=3, and TV =4, as well as dynamic 
N= {2,3,4} on all nine of the MAQC-II training sets. For 
example, in endpoints A and B, 7\T= 4 yields a statistically 
significant improvement over smaller classifier sizes. For 
endpoints C and E, AT = 2 is the most effective classifier 
size. For endpoint G, there was no significant difference 
between any of the classifier sizes. In six out of the nine 
datasets (endpoints A, C, F, G, H, and I) there was no 
significant difference in MCC between dynamic N and 
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Figure 5 Results of TSN classification on iVlAQC-ll datasets. MCC of MAQC-II endpoints A through I, based on models learned on the training 
set and then applied to the validation set. MCC values range from 4-1 (perfect prediction) to -1 (perfect Inverse prediction), with 0 Indicating 
random prediction. Boxplots show the MCC distribution of the models from the 15 groups, Including TSN, that predicted all original and swap 
endpoints from the MAQC-II. The original and swap MCC values are averaged for each group. In addition to endpoints A through I, a boxplot 
showing the mean MCC over endpoints A through H Is shown (ALL). We exclude endpoint I from this final boxplot because It Is a negative 
control. The bottom and top of each box Indicate the lower and upper quartlles of the data, respectively. The middle line represents the 
median. The whiskers indicate the extreme values. The asterisk represents the performance of TSN on that dataset. All raw data is included In 
Additional file 3. 
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the highest-scoring fixed value of N. The complete 
results are available in Additional file 1: Figure S5. All 
raw data is included in Additional file 4. 

In order to test the amount of overfitting, we calculated 
the difference of the MCC values from each validation 
set and the corresponding MCC values from training set 
cross validation for each group. The cross validation per- 
formed for TSN was 5-fold cross validation, repeated 10 
times, as recommended by the MAQC-II consortium. 
These results are presented in Figure 6 as boxplots show- 
ing the distribution of AMCC values. To prevent negative 
and positive values canceling each other out, the absolute 
value of each AMCC was used. Both original and swap 
datasets were included in the calculation of AMCC. TSN 
has a mean AMCC = 0.101, ranking second after SAS for 
the lowest AMCC of any of the MAQC-II participants - 
demonstrating that TSN had a remarkably low overfitting 
to the data. 

For all analyses in this paper, up to sixteen differen- 
tially expressed genes were selected by the Wilcoxon 
rank sum test to input into the TSN algorithm. The fact 
that so few features were input to TSN in these analyses 
could explain the low levels of overfitting it exhibits. To 
test this, we ran all MAQC-II training sets (except for 
the negative control endpoint I, which would bias the 
results of AMCC towards zero) over a range of input 
feature sizes. For N=2, we input a range of 16 to 10,000 



input features. For N= 3 we input a range of 10 to 670 
input features. For A/^ = 4 we input a range of 9 to 188 in- 
put features. These numbers were chosen to span approxi- 
mately the same range of possible feature combinations 
for each value of N (approximately 120 combinations up 
to 50 million combinations). Finally we ran dynamic N 
for A/^= {2,3,4} over the same ranges of input feature 
sizes. AMCC values were calculated for each input fea- 
ture size, and box plots of their distributions are shown 
in Additional file 1: Figure S6. All raw data is included in 
Additional file 5. While the mean AMCC values do in- 
crease as a function of input feature size, overall the 
levels of overfitting remain low for TSN despite the in- 
crease. The mean AMCC exhibited by dynamic N TSN 
at the largest input size of [10000, 760, 188], is 0.148. 
This is still among the smallest mean AMCC value 
observed in any of the participating groups; only three 
groups are smaller (GHI, GSK, and SAS). 

Conclusions 

The goal of relative expression classification algorithms 
is to identify simple yet effective classifiers that are resist- 
ant to data normalization procedures and overfitting, 
practical to implement in a clinical environment, and po- 
tentially biologically interpretable. The top-scoring 'N' al- 
gorithm presented here retains these desirable properties 
while allowing a larger combination and permutation 
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Figure 6 AMCC Results from MAQC-II data. Boxplots showing the distribution of AMCC values on the original data for each group, where 
AMCC = Cross Validation MCC - Validation Set MCC. This illustrates the amount of overfitting present during cross validation. The absolute 
value of each AMCC value was used in the calculations. The cross validation performed for TSN was 5-fold cross validation, repeated 1 0 times, 
as recommended by the MAQC-II consortium. Boxplots are sorted by the mean AMCC for each group (asterisk). All raw data is included in 
Additional file 3. 



Magis and Price BMC Bioinformatics 2012, 13:227 
http://www.biomedcentral.com/1471-2105/13/227 



Page 1 0 of 1 1 



space to be searched than that afforded by earlier relative 
expression algorithms such as TSP and TST. TSN can 
also recommend the classifier size (N) most likely to re- 
sult in effective classification based on the training set. 
Of course, more care must be taken to avoid overfitting 
with TSN, particularly on smaller datasets, given that the 
permutation space grows with the factorial numbers. 
However, the problem of overfitting can be well miti- 
gated by choosing a suitably small number of features 
from which to build the classifier, or ensuring that the 
number of samples available is large enough to justify 
searching a larger combination space. All the results pre- 
sented in this paper were performed using between nine 
and sixteen features of the microarray datasets. TSN is 
therefore well suited for datasets of emerging technolo- 
gies that contain smaller numbers of features to begin 
with, such as secretomics and miRNAs. However, as 
Figure 3 demonstrates, it is still possible to search tens of 
thousands of permutations in a relatively short amount 
of time, when justified by large sample sizes. The statis- 
tical significance of the resulting classifiers can then be 
determined though e.g. permutation tests of the class 
labels. 

We have demonstrated the effectiveness of TSN in 
classification of the MAQC-II datasets in comparison 
with many other classification strategies, including artifi- 
cial neural networks, classification trees, discriminant 
analysis, k-Nearest neighbor, naive Bayes, and support 
vector machines, as implemented by several universities 
and companies from around the world. We do not claim 
that TSN is necessarily the best or most effective classi- 
fier for every circumstance. For example, TSN performs 
relatively poorly on endpoint H, which as the positive 
control in which classes were simply assigned as the 
gender of the study participants, should be among the 
easiest to classify. A major strength of the algorithm is 
the level to which the MCC values for cross validation 
agree with the MCC values on the independent valid- 
ation set (AMCC). Importantly, these results indicate a 
very low level of overfitting, and increase our confidence 
that results generated through cross validation on future 
datasets will be effective classifiers on independent valid- 
ation sets. That is, when TSN works on a dataset it is 
relatively more likely to be true, and conversely, when it 
is going to fall short in independent validation it typic- 
ally does not work well in cross validation and so can be 
discarded as a candidate diagnostic early in the process. 
Analyses over a range of input sizes indicate that overfit- 
ting remains low even as input feature numbers increase, 
given sufficient sample sizes. 

Of all the MAQC-II participants, including TSN, 
group SAS yielded the lowest mean AMCC score 
(0.074), indicating low levels of overfitting. Group SAI 
yielded the highest mean MCC (0.4893) for original and 



swap datasets, indicating high levels of validation set ac- 
curacy based on the training set. Both of these groups 
utilized multiple classification strategies across all end- 
points. For example, group SAS used logistic regression 
for endpoints A, E, and I, support vector machines for 
endpoints B, G, and H, partial least squares regression 
for endpoints D and F, and a decision tree for endpoint C. 
Group SAI used support vector machines for endpoints 
A, B, E, F, G, and I, k-nearest neighbor for endpoints C 
and H, and a machine learning classifier for endpoint D. 
Group SAI also used a range of different feature selec- 
tion methods for each endpoint. Both groups also used 
different classification strategies for the swap datasets. 
For example, group SAS used logistic regression for the 
original endpoint E data but partial least squares regres- 
sion on swap endpoint E. Group SAI used a machine 
learning classifier for the original endpoint D, and dis- 
criminant analysis for swap endpoint D [11]. As a result, 
TSN is not only being compared to different classifica- 
tion strategies, but an ensemble of classification strat- 
egies that were chosen in an attempt to maximize 
success for each endpoint across both original and swap 
datasets. Given its advantages of relative simplicity, bio- 
logical interpretability, and low levels of overfitting, the 
TSN algorithm can serve as a useful tool for hypothesis 
generation, particularly as next generation sequencing 
and proteomics technologies yield increasing sensitivity 
in biomolecule measurements. 

Additional files 



Additional file 1: Figure S1. Counting systems. Figure S2: Three 
complete conversions from permutation to decimal. Figure S3: 
Pseudocode for the core operation of the TSN algorithm on the 
GPU. Figure S4: Cancer dataset statistical tests for differences 
between values of W. Figure S5: MAQC-II Statistical tests for 
differences between values of N. Figure S6: AMCC box plots for 
different input sizes on the MAQC-II dataset. 

Additional file 2: Raw data for statistical significance testing of 
cancer results referenced in Figure 4. 

Additional file 3: Raw data for cross validation and test set MCC 
scores and AMCC scores for all MAQC-II participants and TSN, 
referenced in Figures 5 and 6. 

Additional file 4: Raw data for statistical significance testing of 
MAQC-II results referenced in Figure S5. 

Additional file 5: Raw data for AMCC values over a range of input 
feature sizes referenced in Figure S6. 
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